Using MT-Based Metrics for RTE
نویسندگان
چکیده
We analyse the complexity of the RTE task data and divide the T/H pairs into three different classes, depending on the type of knowledge required to solve the problem. We then propose an approach which is suitable for the easier two classes, which account for two thirds of all pairs. Our assumption is that T and H are translations of the same source sentence. We then use a metric for MT evaluation (Meteor) in order to judge the similarity of both translations. It is clear that in most cases when T entails H, T and H do not have exactly the same meaning. However, we can observe that the similarity is still much higher for positive T/H pairs than for negative pairs. We achieve a result of 46.34 macro-average F1-score for the task. On one hand-side, it shows that our approach has its weaknesses especially because our assumption that T and H contain the same meaning does not always hold, especially if T and H have very different lengths. On the other hand considering the fact that RTE-7 is a difficult class-imbalanced problem (<5% YES, >95% NO) this robust approach achieves a decent result for a large amount of data. It is above the median of this year's results and is comparable with the top results from the previ-
منابع مشابه
The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملRegression for Sentence-Level MT Evaluation with Pseudo References
Many automatic evaluation metrics for machine translation (MT) rely on making comparisons to human translations, a resource that may not always be available. We present a method for developing sentence-level MT evaluation metrics that do not directly rely on human reference translations. Our metrics are developed using regression learning and are based on a set of weaker indicators of fluency a...
متن کاملSensitivity of Automated MT Evaluation Metrics on Higher Quality MT Output: BLEU vs Task-Based Evaluation Methods
We report the results of an experiment to assess the ability of automated MT evaluation metrics to remain sensitive to variations in MT quality as the average quality of the compared systems goes up. We compare two groups of metrics: those which measure the proximity of MT output to some reference translation, and those which evaluate the performance of some automated process on degraded MT out...
متن کاملCombining Confidence Estimation and Reference-based Metrics for Segment-level MT Evaluation
We describe an effort to improve standard reference-based metrics for Machine Translation (MT) evaluation by enriching them with Confidence Estimation (CE) features and using a learning mechanism trained on human annotations. Reference-based MT evaluation metrics compare the system output against reference translations looking for overlaps at different levels (lexical, syntactic, and semantic)....
متن کاملChoosing the Best MT Programs for CLIR Purposes - Can MT Metrics Be Helpful?
This paper describes usage of MT metrics in choosing the best candidates for MT-based query translation resources. Our main metrics is METEOR, but we also use NIST and BLEU. Language pair of our evaluation is English German, because MT metrics still do not offer very many language pairs for comparison. We evaluated translations of CLEF 2003 topics of four different MT programs with MT metrics a...
متن کامل